-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD 0072: Feature Gate Threshold Automation #72
base: main
Are you sure you want to change the base?
SIMD 0072: Feature Gate Threshold Automation #72
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this SIMD - I'd like to ask about some clarification.
On first read, it wasn't immediately clear to me whether this change will install more human approval or less. Would it be possible to clarify in the summary whether:
- Validator software will automatically signal all of its supported features when started, without further prompt from the operator.
- Validator operators will now have to manually arm for feature gates via a validator CLI command.
The summary mentions that the process is automated and manual human action is removed, which would imply 1. But further below it mentions that the operator runbook now includes the new transaction step, which would imply 2.
FWIW, I oppose mechanism 1 because it'd make it virtually impossible to deploy a Labs-independent client safely. For various reasons, I expect Firedancer (validator software) to take a bit longer to implement new feature gates, and have a small minority of stake for the forseeable future. I'm worried that in mechanism 1, Labs nodes would just overrule other nodes all the time, and activate features that are not supported by all implementations.
@ripatel-fd : However, the more fundamental principle of this proposal is that actual activation of the feature is automatic based on percentage of stake support, default 95%. While the Labs client no longer represents anywhere near 95% of stake, it's true that no one entity or client implementation (including Firedancer) would be able to block activation en masse until they represent 5+% of stake. Even if this SIMD is approved and implemented, feature-gate key-holders can continue to negotiate with all the clients to determine when to queue the feature for activation, as we do now -- and we envision separate SIMDs around formal governance on approving/queuing proposed features. So this SIMD wouldn't prevent the level of human control that currently exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this proposal is a much needed one. As I have mentioned though, in my comments, I think we need a separate SIMD for the new program and everything about how it will work.
I also think some consideration as to how this can tie into a governance-based mechanism would be good. For example, governance could approve the creation of new features, the description of such a vote would specify exactly what the new feature will do.
Good stuff so far!
We (@ripatel-fd and I) spent some more time discussing this proposal and we had several more comments on it:
We propose that there is 3 phase system for feature gate ratification:
We would be happy to collaborate on proposal(s)/specification(s) which go in the above direction. As it stands, it introduces the aforementioned bootstrapping issue, which would give no recourse for the community to dissent to new features. This could prolong the development of clients with less resources. More importantly than our own concerns though (which are admittedly self-interested) this proposal would further Solana Labs' hegemony on the chain via their control of client features. This may be against the wants of validators who actually control the blockspace of Solana. It should not be the version of the Solana Labs client or the clients implicit capabilities which control the blockspace, it must be the people. |
@lheeger-jump , thanks for your thoughts. To clarify, the system we envision is 2 phases:
To me, these two steps are synonymous to the Version Availability and Activation steps you describe, with the same essential stake and participation requirements, just in the opposite order. Can you please expand on how changing the order of these steps changes the amount of control any particular client wields? |
Thanks for clarifying @CriesofCarrots. I should have read the SIMD more carefully. I think we are ideologically aligned in that this is only a "feature support discovery" mechanism, and that the actual activation logic can be specified in a separate PR. This is definitely a step in the right direction. The above criticism regarding risk to another clients that me and @lheeger-jump posted is therefore invalid. We had some thoughts about how the permissioned part can be removed (adding new features to the queue). It would involve changing the negotiation API like so:
Example API enum FeatureInstruction {
/// Initializes a feature account
/// # Account references
/// 0. `[writable]` The feature account
/// 1. `[signer]` The feature authority
Initialize,
/// Closes a feature account without nominating it
/// # Account references
/// 0. `[writable]` The feature account
/// 1. `[signer]` The feature authority
ClosesFeature,
/// Advertise support for feature. Can only be called by current leader, once per rotation.
/// # Account references
/// 0. `[writable]` The feature account
/// 1. `[signer]` Current leader
LeaderAdvertise,
/// Nominates a feature for a governance vote.
/// # Account references
/// 0. `[writable]` The feature account
/// 1. `[signer]` The feature authority
Nominate,
} The disadvantage of this change is that it increases the amount of transactions and data required to activate a feature. |
This should be more clear in the SIMD and should be very explicit. Thanks for the clarification. Nonetheless, the arming mechanism described is relevant to the discussion. |
The 3 steps I described are to allow anyone to create a new feature, but not to allow them to be able to be the activator (that would be governance). I think what I am saying is that anyone can do creation (i.e. anyone can propose new features) and that is separate from activation which should not be voted on. I think you basically say that here:
By splitting up the first step into creation and then activation after version availability is satisfied, you disallow the feature owner from ramming the feature on and the choice to enable is explicit. Less about client control. The client control is more about a lack of governance on feature activation. |
546b3ab
to
d1817fa
Compare
ca3d674
to
2497aee
Compare
Hey everyone! It's time to dust this one off again now that we have #88 and #89 merged! We've introduced a completely revised version of this proposal, re-written from scratch. It should assess all of the discussion points raised above in previous threads, as well as numerous concerns raised in various offline discussions. I'd love it if we can open back up the review process for this one, and start working toward a new feature activation process! @CriesofCarrots @lheeger-jump @ripatel-fd P.S. I've included a link to the Feature Gate program's reference implementation in the PR description. |
@ripatel-fd and I can take another look in the next few days |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general approach!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, added some clarifying comments/questions, the answers to which would be good to have in the SIMD text.
Basically trying to be explicit about all assumptions/details, so that a client can implement this SIMD using just this SIMD.
Nice one!
a643bdf
to
177b45a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One substantive question, otherwise mostly suggestions for clarity.
Transactions sent too late (< 4500 slots before the epoch end) will be rejected | ||
by the Feature Gate program. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need any deadline/restriction? I see the program cache being alluded to, but I don't see any specification that changes to the Staged Features PDA must begin or conclude at some point earlier than the epoch boundary. It's not obvious why the Feature Gate program can't continue to modify the Staged Features PDA up to the epoch boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting to explicitly explain the reasoning for this constraint in the SIMD, or are you asking why we actually need such a constraint? In either case, I agree it should be more explicit in the proposal.
It seems to me that @Lichtso intends to increase the cache preparation phase's initiation to be sooner in the epoch (comment), so we probably need a more concrete explanation as to why the Feature Gate program should have this constraint.
In this proposal, it was initially set to 128 slots remaining in the epoch because the cache preparation stage begins at a maximum of 128 slots before the end of the epoch, as computed here.
It's worth calling out that if we configure this constraint in the Feature Gate program and then later need to adjust it for an earlier cache preparation start, we'll be bound to using a feature gate for that change to the program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking why we need a constraint at all. The cache being referenced is supposedly an agave-specific implementation, not part of the runtime protocol. If feature activation needs to work around any aspect of cache behavior, that seems like (a) a problem; and (b) a thing that definitely needs to be specified for all clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's really nothing stopping the cache from being able to anticipate the upcoming feature environment by querying the PDA under the Feature Gate program that would tell it what the current stake support is for each feature. It's actually arguably equally or more deterministic than the way it's done now.
The cache could run the calculation at its current 128 slots before epoch boundary and probably get a pretty good understanding of which features have enough stake support. Increasing that slot buffer beyond 128 could just mean increasing the likelihood that not enough stake has been signaled for a feature. Perhaps this is a compromise the cache implementation should consider? After all, I believe recompilation is an optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a commit removing the "slots remaining" constraint, since I think the above is a sensible direction to head in. If there's explicit opposition, it can be added back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my imprecise formulation, I meant scheduled for activation.
If a feature is scheduled for activation in some slot close to the boundary, then it is possible that the TX is not yet rooted. Thus, some validator nodes will not have it scheduled, and will not activate it on the boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain how this is different from the cluster coming to consensus on any other tx/block on which other logic depends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For features which only affect their own fork this is irrelevant. But for features which change validator global behavior this would break. And I am not sure if there is much to be gained from making exceptions or harving different feature types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding an additional point of clarity here, if it helps.
Right now, in Agave, the cache is calling compute_active_feature_set
, which will compute the feature set based on the on-chain accounts that have been created under Feature11111...
(ie. "activated"). The cache does this ~128 slots before the epoch boundary to prepare the cache entries for the anticipated upcoming environment.
With the proposed system in place, the cache can still call compute_active_feature_set
the same way it does now. The function will just compute the anticipated feature set from the proposed PDA (with stake support) rather than just the presence of on-chain accounts.
In almost every way it's the same as what we have now. The subtle difference is that someone can no longer activate a feature after the cache has done its feature-set computation, but nodes could still cast votes (signals) on a feature, bringing it to "eligible" status. This edge case exists today and just causes cache preparation/recompilation to be moot.
What @CriesofCarrots is suggesting is that we don't explicitly architect this mechanism around the cache, which is sensible, especially considering the cache preparation may change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lichtso do you still have concerns here? I'd like to move this SIMD along.
4d2b500
to
5ca6da6
Compare
5ca6da6
to
58cb436
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple small comments on the new PDA wording. Just the question about "slots remaining" constraint to be dealt with, as far as I'm concerned.
|
||
The `SignalSupportForStagedFeatures` instruction is structured as follows: | ||
|
||
- Data: A `u8` bit mask of the staged features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also need to include the epoch? As the list of staged features changes epoch-to-epoch, we need a way of ensuring that the bit mask in each validator support signal transaction is interpreted for the correct epoch. If the transaction is delayed or is executed in a different epoch than the one it was sent, for whatever reason, then the support signal will be mis-interpreted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess the Staged Features PDA
supplied should indicate the intended epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the PDA for the validator's support signal. When they send a bitmask, it gets recorded in their signal PDA, and if they send again, it's overwritten. The current epoch is checked via Clock
sysvar when the program is invoked with this instruction.
This is SIMD 3/4 expected for Multi-Client Feature Gates. See #76
Summary
This SIMD outlines a proposal for automating the feature activation process
based on a stake-weighted support threshold, rather than manual human action.
With this new process, contributors no longer have to assess stake support for a
feature before activation. Instead, the assessment is done by the runtime.
A reference implementation for the Feature Gate program introduced in #89 and mentioned in this SIMD can be found here.